Data Description:
The file Bank.xls contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.
Domain: Banking
Context: This case is about a bank (Thera Bank) whose management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with minimal budget.
Attribute Information:
ID : Customer IDAge : Customer's age in completed yearsExperience : #years of professional experienceIncome : Annual income of the customer (USD 000)ZIP Code : Home Address ZIP code.Family : Family size of the customerCCAvg : Avg. spending on credit cards per month (USD 000)Education : Education Level.Mortgage : Value of house mortgage if any. (USD 000)Personal Loan : Did this customer accept the personal loan offered in the last campaign?Securities Account : Does the customer have a securities account with the bank?CD Account : Does the customer have a certificate of deposit (CD) account with the bank?Online : Does the customer use internet banking facilities?Credit card : Does the customer use a credit card issued by UniversalBank?Objective:
The classification goal is to predict the likelihood of a liability customer buying personal loans.
Steps and tasks:
#Numerical operations
import numpy as np
from scipy.stats import norm
#Data Exploration
import pandas as pd
#Data Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("darkgrid")
#Data Preparation & Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
#Model Building - Classifiers
import statsmodels.api as sm # Logit
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
#Model Selection & Evaluation
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, log_loss, roc_curve,precision_recall_curve
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score
#Suppress warnings
import warnings
warnings.filterwarnings('ignore')
bankData = pd.read_csv("/content/Bank_Personal_Loan_Modelling.csv")
#Visualising the top 5 records
bankData.head()
#Visualising the bottom 5 records
bankData.tail()
print("No. of rows of the Data set : ", bankData.shape[0])
print("No. of columns of the Data set : ", bankData.shape[1])
bankData.info()
#Checking the Missing Values for all the features
bankData.isnull().any()
#Re-ordering the target variable 'Personal Loan' as the last column for our convenience
temp = bankData['Personal Loan']
bankData.drop(['Personal Loan'], axis = 1, inplace=True)
bankData['Personal Loan'] = temp
bankData.head()
bankData.describe().T
Observations from Dataset Columns:
Input Variables
ID : Age : Experience : Income : ZIP Code : Family : CCAvg : Education : Mortgage : Securities Account : CD Account : Online : Credit card : Dependent/Target Variable
Personal Loan : #Black Curve: Normal Distribution; Blue: Variable Distribution
sns.distplot(bankData["Age"], rug = True, fit=norm);
#Histogram with Mean, Median and Mode highlighted
plt.figure(figsize = (10,6))
plt.axvline(bankData["Age"].mean(), color='r', linestyle='--')
plt.axvline(bankData["Age"].median(), color='g', linestyle='-')
plt.axvline(bankData["Age"].mode()[0], color='y', linestyle='-')
plt.legend({'Mean':bankData["Age"].mean(),'Median':bankData["Age"].median(),'Mode':bankData["Age"].mode()[0]})
plt.xlabel("Age")
plt.hist(bankData["Age"], bins = 30)
plt.show()
print("Skewness of Age variable is : ", bankData["Age"].skew().round(2))
#Outliers check using Box Plot & IQR
sns.boxplot('Age', data = bankData);
Q1 = bankData["Age"].quantile(0.25)
Q3 = bankData["Age"].quantile(0.75)
IQR = Q3 - Q1
print("Inter Quartile Range of Age variable : ", IQR)
age_outliers = np.where((bankData["Age"] < (Q1 - 1.5 * IQR)) | (bankData["Age"] > (Q3 + 1.5 * IQR)))
print("Outliers beyond 1.5 x IQR for Age are : ",age_outliers[0])
print("\n")
print("Number of outliers for Age are : ", len(age_outliers[0]))
print("Percentage of outliers for Age are : ", round((len(age_outliers[0]) / len(bankData["Age"]) * 100),2), "%")
Observations on Age:
Age is Normally distributed with very minimal skewness. #Black Curve: Normal Distribution; Blue: Variable Distribution
sns.distplot(bankData["Experience"], rug = True, fit=norm);
bankData["Experience"].skew()
#Treating negative values in Experience column
print("Total Number of Negative values in Experience : ",bankData[bankData['Experience']<0]['Experience'].value_counts().sum())
print("Negative values in Experience and their frequency : \n",bankData[bankData['Experience']<0]['Experience'].value_counts())
Imputing the negative values with Median (For Experience, Mean = Median, with minimal Skewness)
#Replacing the negative values with NaN
bankData['Experience'].replace( to_replace= -1,value = np.nan,inplace = True )
bankData['Experience'].replace( to_replace= -2,value = np.nan,inplace = True )
bankData['Experience'].replace( to_replace= -3,value = np.nan,inplace = True )
#Imputing the NaN values with Median
bankData['Experience'].fillna(bankData['Experience'].median(),inplace=True)
bankData.describe().T #Updated Five Number Summary
#Rechecking the distribution post Imputation
sns.distplot(bankData["Experience"], rug = True, fit=norm);
#Histogram with Mean, Median and Mode highlighted
plt.figure(figsize = (10,6))
plt.axvline(bankData["Experience"].mean(), color='r', linestyle='--')
plt.axvline(bankData["Experience"].median(), color='g', linestyle='-')
plt.axvline(bankData["Experience"].mode()[0], color='y', linestyle='-')
plt.legend({'Mean':bankData["Experience"].mean(),'Median':bankData["Experience"].median(),'Mode':bankData["Experience"].mode()[0]})
plt.xlabel("Experience")
plt.hist(bankData["Experience"], bins = 30)
plt.show()
bankData["Experience"].skew()
#Outliers check using Box Plot & IQR
sns.boxplot('Experience', data = bankData);
Q1 = bankData["Experience"].quantile(0.25)
Q3 = bankData["Experience"].quantile(0.75)
IQR = Q3 - Q1
print("Inter Quartile Range of Experience variable : ", IQR)
exp_outliers = np.where((bankData["Experience"] < (Q1 - 1.5 * IQR)) | (bankData["Experience"] > (Q3 + 1.5 * IQR)))
print("Outliers beyond 1.5 x IQR for Experience are : ",exp_outliers[0])
print("\n")
print("Number of outliers for Experience are : ", len(exp_outliers[0]))
print("Percentage of outliers for Experience are : ", round((len(exp_outliers[0]) / len(bankData["Experience"]) * 100),2), "%")
Observations on Experience:
Experience is Normally distributed with very minimal skewness. #Black Curve: Normal Distribution; Blue: Variable Distribution
sns.distplot(bankData["Income"], rug = True, fit=norm);
#Histogram with Mean, Median and Mode highlighted
plt.figure(figsize = (10,6))
plt.axvline(bankData["Income"].mean(), color='r', linestyle='--')
plt.axvline(bankData["Income"].median(), color='g', linestyle='-')
plt.axvline(bankData["Income"].mode()[0], color='y', linestyle='-')
plt.legend({'Mean':bankData["Income"].mean(),'Median':bankData["Income"].median(),'Mode':bankData["Income"].mode()[0]})
plt.xlabel("Income")
plt.hist(bankData["Income"], bins = 30)
plt.show()
bankData["Income"].skew()
#Outliers check using Box Plot & IQR
sns.boxplot('Income', data = bankData);
Q1 = bankData["Income"].quantile(0.25)
Q3 = bankData["Income"].quantile(0.75)
IQR = Q3 - Q1
print("Inter Quartile Range of Income variable : ", IQR)
inc_outliers = np.where((bankData["Income"] < (Q1 - 1.5 * IQR)) | (bankData["Income"] > (Q3 + 1.5 * IQR)))
print("Outliers beyond 1.5 x IQR for Income are : ",inc_outliers[0])
print("\n")
print("Number of outliers for Income are : ", len(inc_outliers[0]))
print("Percentage of outliers for Income are : ", round((len(inc_outliers[0]) / len(bankData["Income"]) * 100),2), "%")
#Treating the Skewness with Log transformation
bankData['Income'] = np.log(bankData['Income'])
print("Skewness post Log transformation : ",bankData["Income"].skew())
sns.distplot(bankData["Income"], rug = True, fit=norm)
plt.title("Income distribution post Log transformation")
plt.show()
#Outliers check using Box Plot & IQR
sns.boxplot('Income', data = bankData);
Q1 = bankData["Income"].quantile(0.25)
Q3 = bankData["Income"].quantile(0.75)
IQR = Q3 - Q1
print("Inter Quartile Range of Income variable : ", IQR)
inc_outliers = np.where((bankData["Income"] < (Q1 - 1.5 * IQR)) | (bankData["Income"] > (Q3 + 1.5 * IQR)))
print("Outliers beyond 1.5 x IQR for Income are : ",inc_outliers[0])
print("\n")
print("Number of outliers for Income (post log transform) : ", len(inc_outliers[0]))
print("Percentage of outliers for Income (post log transform) : ", round((len(inc_outliers[0]) / len(bankData["Income"]) * 100),2), "%")
Observations on Income:
Income is positively skewed in distribution with long-tail. Income using Log transformation and reduced the outliers from 2% to 1%. #Black Curve: Normal Distribution; Blue: Variable Distribution
sns.distplot(bankData["CCAvg"], rug = True, fit=norm);
#Histogram with Mean, Median and Mode highlighted
plt.figure(figsize = (10,6))
plt.axvline(bankData["CCAvg"].mean(), color='r', linestyle='--')
plt.axvline(bankData["CCAvg"].median(), color='g', linestyle='-')
plt.axvline(bankData["CCAvg"].mode()[0], color='y', linestyle='-')
plt.legend({'Mean':bankData["CCAvg"].mean(),'Median':bankData["CCAvg"].median(),'Mode':bankData["CCAvg"].mode()[0]})
plt.xlabel("CCAvg")
plt.hist(bankData["CCAvg"], bins = 30)
plt.show()
bankData["CCAvg"].skew()
#Outliers check using Box Plot & IQR
sns.boxplot('CCAvg', data = bankData);
Q1 = bankData["CCAvg"].quantile(0.25)
Q3 = bankData["CCAvg"].quantile(0.75)
IQR = Q3 - Q1
print("Inter Quartile Range of CCAvg variable : ", IQR)
ccavg_outliers = np.where((bankData["CCAvg"] < (Q1 - 1.5 * IQR)) | (bankData["CCAvg"] > (Q3 + 1.5 * IQR)))
print("Outliers beyond 1.5 x IQR for CCAvg are : ",ccavg_outliers[0])
print("\n")
print("Number of outliers for CCAvg are : ", len(ccavg_outliers[0]))
print("Percentage of outliers for CCAvg are : ", round((len(ccavg_outliers[0]) / len(bankData["CCAvg"]) * 100),2), "%")
Here CCAvg variable has zero values and hence applying Log transformation will result in error.
Therefore, we will be doing log(x+1) transformation to mitigate divided by zero error.
#Treating the Skewness with Log transformation
bankData['CCAvg'] = bankData['CCAvg'].apply(lambda x : x+1)
bankData['CCAvg'] = np.log(bankData['CCAvg'])
print("Skewness post Log transformation : ",bankData["CCAvg"].skew())
sns.distplot(bankData["CCAvg"], rug = True, fit=norm)
plt.title("CCAvg distribution post Log transformation")
plt.show()
#Outliers check using Box Plot & IQR
sns.boxplot('CCAvg', data = bankData);
Q1 = bankData["CCAvg"].quantile(0.25)
Q3 = bankData["CCAvg"].quantile(0.75)
IQR = Q3 - Q1
print("Inter Quartile Range of CCAvg variable : ", IQR)
ccavg_outliers = np.where((bankData["CCAvg"] < (Q1 - 1.5 * IQR)) | (bankData["CCAvg"] > (Q3 + 1.5 * IQR)))
print("Outliers beyond 1.5 x IQR for CCAvg are : ",ccavg_outliers[0])
print("\n")
print("Number of outliers for CCAvg (post log(x+1) transform) : ", len(ccavg_outliers[0]))
print("Percentage of outliers for CCAvg (post log(x+1) transform) : ", round((len(ccavg_outliers[0]) / len(bankData["CCAvg"]) * 100),2), "%")
Observations on CCAvg:
CCAvg is positively skewed in distribution with long-tail. CCAvg using Log(x+1) transformation and reduced #Black Curve: Normal Distribution; Blue: Variable Distribution
sns.distplot(bankData["Mortgage"], rug = True, fit=norm);
#Histogram with Mean, Median and Mode highlighted
plt.figure(figsize = (10,6))
plt.axvline(bankData["Mortgage"].mean(), color='r', linestyle='--')
plt.axvline(bankData["Mortgage"].median(), color='g', linestyle='-')
plt.axvline(bankData["Mortgage"].mode()[0], color='y', linestyle='-')
plt.legend({'Mean':bankData["Mortgage"].mean(),'Median':bankData["Mortgage"].median(),'Mode':bankData["Mortgage"].mode()[0]})
plt.xlabel("Mortgage")
plt.hist(bankData["Mortgage"], bins = 30)
plt.show()
bankData["Mortgage"].skew()
#Outliers check using Box Plot & IQR
sns.boxplot('Mortgage', data = bankData);
Q1 = bankData["Mortgage"].quantile(0.25)
Q3 = bankData["Mortgage"].quantile(0.75)
IQR = Q3 - Q1
print("Inter Quartile Range of Mortgage variable : ", IQR)
mrtg_outliers = np.where((bankData["Mortgage"] < (Q1 - 1.5 * IQR)) | (bankData["Mortgage"] > (Q3 + 1.5 * IQR)))
print("Outliers beyond 1.5 x IQR for Mortgage are : ",mrtg_outliers[0])
print("\n")
print("Number of outliers for Mortgage are : ", len(mrtg_outliers[0]))
print("Percentage of outliers for Mortgage are : ", round((len(mrtg_outliers[0]) / len(bankData["Mortgage"]) * 100),2), "%")
Here Mortgage variable has zero values and hence applying Log transformation will result in error. Therefore, we will be doing log(x+1) transformation to mitigate divided by zero error.
#Treating the Skewness with Log transformation
bankData['Mortgage'] = bankData['Mortgage'].apply(lambda x : x+1)
bankData['Mortgage'] = np.log(bankData['Mortgage'])
print("Skewness post Log transformation : ",bankData["Mortgage"].skew())
sns.distplot(bankData["Mortgage"], rug = True, fit=norm)
plt.title("Mortgage distribution post Log transformation")
plt.show()
#Outliers check using Box Plot & IQR
sns.boxplot('Mortgage', data = bankData);
Q1 = bankData["Mortgage"].quantile(0.25)
Q3 = bankData["Mortgage"].quantile(0.75)
IQR = Q3 - Q1
print("Inter Quartile Range of Mortgage variable : ", IQR)
mrtg_outliers = np.where((bankData["Mortgage"] < (Q1 - 1.5 * IQR)) | (bankData["Mortgage"] > (Q3 + 1.5 * IQR)))
print("Outliers beyond 1.5 x IQR for Mortgage are : ",mrtg_outliers[0])
print("\n")
print("Number of outliers for Mortgage (post log(x+1) transform) : ", len(mrtg_outliers[0]))
print("Percentage of outliers for Mortgage (post log(x+1) transform) : ", round((len(mrtg_outliers[0]) / len(bankData["Mortgage"]) * 100),2), "%")
Observations on Mortgage:
Mortgage is positively skewed in distribution with long-tail. Mortgage using Log(x+1) transformation and reduced sns.countplot(y = "Family", data = bankData, palette = "magma")
plt.ylabel("No. of Family members")
plt.xlabel("No. of records")
plt.title("Number of records based on no. of Family size")
plt.show()
#Pie Chart
labels = ['1', '2', '3', '4']
plt.pie(bankData['Family'].value_counts(), labels = labels, autopct='%1.1f%%', shadow=True)
plt.title("Distribution of Customers based on Family Size")
plt.show()
Observations on Family:
print(bankData['Education'].value_counts())
sns.countplot(y = "Education", data = bankData, palette = "magma")
plt.ylabel("Educational Qualification")
plt.xlabel("No. of records")
plt.title("Number of records based on no. of Educational Qualification")
plt.show()
#Pie Chart
labels = ['Undergrad', 'Postgrad', 'Advanced/Professional']
plt.pie(bankData['Education'].value_counts(), labels = labels, autopct='%1.1f%%', shadow=True)
plt.title("Distribution of Customers based on Educational Qualification")
plt.show()
Observations on Education:
sns.countplot(y = "Securities Account", data = bankData, palette = "magma")
plt.ylabel("No. of Customers with Securities Account")
plt.xlabel("No. of records")
plt.title("Number of records based on Securities Account")
plt.show()
#Pie Chart
labels = ['No', 'Yes']
plt.pie(bankData['Securities Account'].value_counts(), labels = labels, autopct='%1.1f%%', shadow=True)
plt.title("Distribution of Customers based on Securities Account availability")
plt.show()
Observations on Securities Account:
sns.countplot(y = "CD Account", data = bankData, palette = "magma")
plt.ylabel("No. of Customers with CD Account")
plt.xlabel("No. of records")
plt.title("Number of records based on CD Account")
plt.show()
#Pie Chart
labels = ['No', 'Yes']
plt.pie(bankData['CD Account'].value_counts(), labels = labels, autopct='%1.1f%%', shadow=True)
plt.title("Distribution of Customers based on CD Account availability")
plt.show()
Observations on CD Account:
sns.countplot(y = "Online", data = bankData, palette = "magma")
plt.ylabel("No. of Customers with Online Bank Account")
plt.xlabel("No. of records")
plt.title("Number of records based on Online Bank Account")
plt.show()
#Pie Chart
labels = ['Yes', 'No']
plt.pie(bankData['Online'].value_counts(), labels = labels, autopct='%1.1f%%', shadow=True)
plt.title("Distribution of Customers based on Online Bank Account availability")
plt.show()
Observations on Online:
sns.countplot(y = "CreditCard", data = bankData, palette = "magma")
plt.ylabel("No. of Customers with Credit Card")
plt.xlabel("No. of records")
plt.title("Number of records based on Credit Card")
plt.show()
#Pie Chart
labels = ['No', 'Yes']
plt.pie(bankData['CreditCard'].value_counts(), labels = labels, autopct='%1.1f%%', shadow=True)
plt.title("Distribution of Customers based on CreditCard availability")
plt.show()
Observations on CreditCard:
Personal Loan¶print("No of customers opted for Personal Loan is : ", bankData['Personal Loan'].value_counts()[1])
print("No of customers not opted for Personal Loan is : ", bankData['Personal Loan'].value_counts()[0])
colors = ["silver", "lightgreen"]
labels = ['No', 'Yes']
plt.pie(bankData['Personal Loan'].value_counts(), labels = labels, colors = colors, explode = (0,1), autopct='%1.1f%%', shadow=True)
plt.title("Distribution of Customers based on Personal Loan")
plt.show()
Let us check the variables grouped by Personal Loan
bankData.groupby(bankData['Personal Loan']).mean()
Observations from Target variable distribution:
Income of Customers who opted for Personal Loan is USD 144,000 whereas for non-opting Customers it is USD 66,000.CCAvg for Customers who opted for Personal Loan varies from those who hasn't.Education qualification of Customers who opted for Personal Loanis higher than those who hasn't.Mortgage values of Customers who opted for Personal Loan is twice as that of those who hasn't. Personal Loan have also opted for CD Account (Certificate of Deposit). Income, Education, CCAvg, Mortgage and CD Account seems like some of the important variables that needs to be included in the Model.plt.figure(figsize = (16,16))
sns.pairplot(data=bankData, diag_kind='kde')
print("Scatter Plot between all pairs of variables\n")
plt.show()
plt.figure(figsize = (16,16))
sns.pairplot(data=bankData, diag_kind='kde', hue='Personal Loan')
print("Scatter Plot between all pairs of variables based on Personal Loan subscription")
plt.show()
From the diagnoal KDE plots, we can see that Income is a good predictor of target classes due to the minimal overlap. Similarly, CCAvg also has better segregation of KDE plots. However, Income ans CCAvg seems to be positively correlated.
plt.figure(figsize=(15,15))
sns.heatmap(bankData.corr(), annot=True, fmt=".2f", linewidth=0.2, cmap='YlGnBu')
plt.show()
Observations from the Pairwise plots and Correlation Heatmap:
Age and Experience are highly correlated. It is better to drop one of them. Income and CCAvg have positive correlation. Personal Loan has positive correlation with Income and moderate correlation with CCAvg and CD Account variables. Personal Loan has no correlation with ZIP Code, Online and CreditCard variables. It is better to drop them from our Model.#Distribution of Age vs Personal Loan
plt.figure(figsize=(10,6))
sns.boxplot(data=bankData, y='Age', x='Personal Loan')
plt.title("Distribution of Age for classes of Personal Loan")
plt.show()
There is no visual difference observed between Personal Loan subscribers vs non-subscribers for different Age groups.
#Distribution of Income vs Personal Loan
plt.figure(figsize=(10,6))
sns.violinplot(data=bankData, y='Income', x='Personal Loan')
plt.title("Distribution of Income for classes of Personal Loan")
plt.show()
Income brackets of Personal Loan subscribers vs non-subscribers is significantly different with most of the Customers in higher income bracket are opting for Personal Loan also (this might indicate the better lifestyle needs of those in higher income brackets).
#Distribution of Family vs Personal Loan
plt.figure(figsize=(10,6))
sns.countplot(data=bankData, y='Family', hue='Personal Loan')
plt.title("Distribution of Family for classes of Personal Loan")
plt.show()
#Distribution of Family vs Personal Loan subscribers
plt.figure(figsize=(10,6))
sns.countplot(data=bankData[bankData["Personal Loan"]==1], y='Family', palette='BuGn')
plt.title("Distribution of Family for classes of Personal Loan Subscribers")
plt.show()
Customers with bigger Family size tend to subscribe more to Personal Loans.
#Distribution of CCAvg vs Personal Loan
plt.figure(figsize=(10,6))
sns.violinplot(data=bankData, y='CCAvg', x='Personal Loan')
plt.title("Distribution of CCAvg for classes of Personal Loan")
plt.show()
Customers with higher Credit Card spending tends to be those who have subscribed most to the Personal Loan campaigns.
#Distribution of Education vs Personal Loan
plt.figure(figsize=(10,6))
sns.countplot(data=bankData, y='Education', hue='Personal Loan')
plt.title("Distribution of Education for classes of Personal Loan")
plt.show()
#Distribution of Education vs Personal Loan subscribers
plt.figure(figsize=(10,6))
sns.countplot(data=bankData[bankData["Personal Loan"]==1], y='Education', palette='BuGn')
plt.title("Distribution of Educational Qualification for classes of Personal Loan Subscribers")
plt.show()
We can say that the Customers with better Educational qualification tends to purchase Personal Loans, though not a strong relationship.
#Distribution of Mortgage vs Personal Loan
plt.figure(figsize=(10,6))
sns.boxplot(data=bankData, y='Mortgage', x='Personal Loan')
plt.title("Distribution of Mortgage for classes of Personal Loan")
plt.show()
No significant relationsship witnessed between Mortgage values and Personal Loans.
#Distribution of Securities Account vs Personal Loan
plt.figure(figsize=(10,6))
sns.countplot(data=bankData, y='Securities Account', hue='Personal Loan')
plt.title("Distribution of Securities Account for classes of Personal Loan")
plt.show()
Though people with Securities Account has lower Personal Loan subscriptions, the proportion of people without Securities Account is so large that we cannot deduce any relationship visually.
#Distribution of Certificate of Deposit Account vs Personal Loan
plt.figure(figsize=(10,6))
sns.countplot(data=bankData, y='CD Account', hue='Personal Loan')
plt.title("Distribution of Certificate of Deposit Account for classes of Personal Loan")
plt.show()
Personal Loan subscribers with Certificate of Deposit account is almost same as those without Certificate of Deposit account.
#Distibution of Income vs Family vs Personal Loan
plt.figure(figsize=(10,6))
sns.boxplot(x='Family', y='Income', hue='Personal Loan', data=bankData, palette = 'Set2');
#Distibution of Income vs Education vs Personal Loan
plt.figure(figsize=(10,6))
sns.boxplot(x='Education', y='Income', hue='Personal Loan', data=bankData, palette='BuGn');
We won't be visualizing the plots for Online and Credit Card since the correlation values between Personal Loan and these two variables is almost '0'.
ID, Experience, ZIP Code columns from the Dataset.bank = bankData.drop(['ID', 'Experience','ZIP Code'], axis=1)
bank.head(3)
X = bank.drop('Personal Loan', axis=1)
y = bank['Personal Loan']
print(X.shape)
print(y.shape)
#Stratify is used to preserve proportion of target classes in splits, since the Class in Imbalanced
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state = 24)
display(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
Dataset will be scaled so that different features with different metric units are comparable. Also, since we will be using K-Nearest Neighbors algorithm which is a distance-based model, it becomes imperative to do scaling prior to model building.
#StandardScaler from Scikit Learn
scaler = StandardScaler()
#Fitting the Training set
scaler.fit(X_train)
StandardScaler is instantiated with X_train information (mean and StdDev). We will use the instance to 'tranform' both train and test sets. This is to protect X_test from training information.
scaler.transform(X_train)
scaler.transform(X_test)
We will be using the below Classifiers:
- Logistic Regression - plain vanilla model
- Logistic Regression - class weighted
- Logistic Regression - Hyper parameter tuned
- Logistic Regression - Statsmodel.api
- KNN - Default values of n_neighbor = 5
- KNN - Hyper parameter tuned
log_reg_ml = LogisticRegression(solver = 'lbfgs', max_iter=500) #lbgfs is the default solver
log_reg_ml.fit(X_train, y_train)
print('Logistic Regression Model\n\n')
print('Logistic Regression Model Accuracy for train set: {0:.3f}'.format(log_reg_ml.score(X_train, y_train)))
print('Logistic Regression Model Accuracy for test set: {0:.3f}'.format(log_reg_ml.score(X_test, y_test)))
y_true, y_pred = y_test, log_reg_ml.predict(X_test)
# Classification Report
print("\nClassification Report")
print('\n{}'.format(classification_report(y_true, y_pred)))
# Confusion Matrix
log_reg_cm = confusion_matrix(y_true, y_pred)
print('\nConfusion Matrix:\n', log_reg_cm)
# Accuracy Score
log_reg_acc = accuracy_score(y_true, y_pred)
print('\nAccuracy Score:\n', log_reg_acc.round(3))
# Precision Score
log_reg_precision = precision_score(y_true, y_pred)
print('\nPrecision Score:\n', log_reg_precision.round(3))
# Recall Score
log_reg_recall = recall_score(y_true, y_pred)
print('\nRecall Score:\n', log_reg_recall.round(3))
# F1 Score
log_reg_f1 = f1_score(y_true, y_pred)
print('\nF1 Score:\n', log_reg_f1.round(3))
# ROC Curve
log_reg_roc_auc = roc_auc_score(y_true, log_reg_ml.predict(X_test))
print('\nROC AUC Score:\n', log_reg_roc_auc.round(3))
#Gini-Coefficient
log_reg_gini_coeff = 2 * log_reg_roc_auc -1
print("\nGini Coefficient :\n", log_reg_gini_coeff.round(2))
#Log Loss
log_loss_logr = log_loss(y_true, y_pred)
print("\nLog Loss :\n", log_loss_logr.round(2))
fpr_lr, tpr_lr, thresholds_lr = roc_curve(y_true, log_reg_ml.predict_proba(X_test)[:,1])
plt.figure(figsize = (14 , 6))
plt.plot(fpr_lr, tpr_lr, label = 'Logistic Regression (area = {})'.format(log_reg_roc_auc.round(2)))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc = 'lower right')
plt.show()
log_reg_sm = sm.Logit(y_train, X_train)
log_reg_sm_ml = log_reg_sm.fit()
log_reg_sm_ml.summary2()
def get_significant_features(lm):
var_p_vals = pd.DataFrame(lm.pvalues)
var_p_vals['vars'] = var_p_vals.index
var_p_vals.columns = ['p_vals','vars']
return list(var_p_vals[var_p_vals['p_vals'] <= 0.05]['vars'])
significant_vars = get_significant_features(log_reg_sm_ml)
significant_vars
log_reg_sm2 = sm.Logit(y_train, X_train[significant_vars])
log_reg_sm_ml2 = log_reg_sm2.fit()
log_reg_sm_ml2.summary2()
LLR p-value: 1.7356e-90 indicates that the Logistic Regression model built is significant.
log_reg_ml_wt = LogisticRegression(solver = 'lbfgs', max_iter=300, class_weight='balanced')
log_reg_ml_wt.fit(X_train, y_train)
Note:
The “balanced” mode uses the values of y to automatically adjust weights
inversely proportional to class frequencies in the input data as
n_samples / (n_classes * np.bincount(y)).
print('Logistic Regression Model Class Weights Balanced \n\n')
print('Logistic Regression Model Accuracy for train set: {0:.3f}'.format(log_reg_ml_wt.score(X_train, y_train)))
print('Logistic Regression Model Accuracy for test set: {0:.3f}'.format(log_reg_ml_wt.score(X_test, y_test)))
print("\n")
y_true, y_pred = y_test, log_reg_ml_wt.predict(X_test)
# Classification Report
print("Classification Report\n")
print('\n{}'.format(classification_report(y_true, y_pred)))
# Confusion Matrix
log_reg_wt_cm = confusion_matrix(y_true, y_pred)
print('\nConfusion Matrix:\n', log_reg_wt_cm)
# Accuracy Score
log_reg_wt_acc = accuracy_score(y_true, y_pred)
print('\nAccuracy Score:\n', log_reg_wt_acc.round(3))
# Precision Score
log_reg_wt_precision = precision_score(y_true, y_pred)
print('\nPrecision Score:\n', log_reg_wt_precision.round(3))
# Recall Score
log_reg_wt_recall = recall_score(y_true, y_pred)
print('\nRecall Score:\n', log_reg_wt_recall.round(3))
# F1 Score
log_reg_wt_f1 = f1_score(y_true, y_pred)
print('\nF1 Score:\n', log_reg_wt_f1.round(3))
# ROC Curve
log_reg_wt_roc_auc = roc_auc_score(y_true, log_reg_ml_wt.predict(X_test))
print('\nROC AUC Score:\n', log_reg_wt_roc_auc.round(3))
#Gini-Coefficient
log_reg_wt_gini_coeff = 2 * log_reg_roc_auc -1
print("\nGini Coefficient :\n", log_reg_wt_gini_coeff.round(2))
#Log Loss
log_loss_class_wt = log_loss(y_true, y_pred)
print("\nLog Loss :\n", log_loss_class_wt.round(2))
fpr_lr_wt, tpr_lr_wt, thresholds_lr_wt = roc_curve(y_true, log_reg_ml_wt.predict_proba(X_test)[:,1])
plt.figure(figsize = (14 , 6))
plt.plot(fpr_lr_wt, tpr_lr_wt, label = 'Logistic Regression (class weighted) (area = {})'.format(log_reg_wt_roc_auc.round(2)))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc = 'lower right')
plt.show()
Let us try to find the best parameters using GridSearchCV (Hyperparameter tuning)
# Logistic Regression with hyperparameter tuning
log_reg = LogisticRegression(random_state = 24)
params = {'penalty': ['l1', 'l2'],
'C': [0.001, 0.01, 0.1, 1, 10, 100],
'solver' : ['lbfgs', 'liblinear','newton-cg', 'sag', 'saga'],
'class_weight' : [{0:100,1:1}, {0:10,1:1}, {0:1,1:1}, {0:1,1:10}, {0:1,1:100}],
'max_iter': [50, 100, 150, 200]}
log_reg_tuned = GridSearchCV(log_reg, param_grid = params, scoring = 'roc_auc', n_jobs = -1, cv = StratifiedKFold(n_splits = 10))
#ROC_AUC_Score is used for Scoring since the given dataset has Class Imbalance.
log_reg_tuned.fit(X_train, y_train)
print('Logistic Regression Scores with Hyperparameter Tuning\n\n')
print('Best Estimator : \n', log_reg_tuned.best_estimator_)
print('Best Hyper Parameters : \n', log_reg_tuned.best_params_)
print('Best Score : \n', log_reg_tuned.best_score_.round(3))
#print('Cross Validation Results:\n', log_reg_tuned.cv_results_ )
print('Logistic Regression (tuned) accuracy for train set: {0:.3f}'.format(log_reg_tuned.score(X_train, y_train)))
print('Logistic Regression (tuned) accuracy for test set: {0:.3f}'.format(log_reg_tuned.score(X_test, y_test)))
y_true, y_pred = y_test, log_reg_tuned.predict(X_test)
# Classification Report
print("\n")
print("Classification Report")
print('\n{}'.format(classification_report(y_true, y_pred)))
# Confusion Matrix
log_reg_tuned_cm = confusion_matrix(y_true, y_pred)
print('\nConfusion Matrix:\n', log_reg_tuned_cm)
# Accuracy Score
log_reg_tuned_accuracy = accuracy_score(y_true, y_pred)
print('\nAccuracy Score:\n', log_reg_tuned_accuracy.round(3))
# Precision Score
log_reg_tuned_precision = precision_score(y_true, y_pred)
print('\nPrecision Score:\n', log_reg_tuned_precision.round(3))
# Recall Score
log_reg_tuned_recall = recall_score(y_true, y_pred)
print('\nRecall Score:\n', log_reg_tuned_recall.round(3))
# F1 Score
log_reg_tuned_f1 = f1_score(y_true, y_pred)
print('\nF1 Score:\n', log_reg_tuned_f1.round(3))
# ROC Curve
log_reg_tuned_roc_auc = roc_auc_score(y_true, log_reg_tuned.predict(X_test))
print('\nROC AUC Score:\n', log_reg_tuned_roc_auc.round(3))
#Gini-Coefficient
log_reg_tuned_gini_coeff = 2 * log_reg_tuned_roc_auc -1
print("\nGini Coefficient :\n", log_reg_tuned_gini_coeff.round(2))
#Log Loss
log_loss_tuned = log_loss(y_true, y_pred)
print("\nLog Loss :\n", log_loss_tuned.round(2))
fpr2, tpr2, thresholds2 = roc_curve(y_true, log_reg_tuned.predict_proba(X_test)[:,1])
plt.figure(figsize = (14, 6))
plt.plot(fpr2, tpr2, label = 'Logistic Regression with Hyperparameter Tuning (area = {})'.format(log_reg_tuned_roc_auc.round(2)))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc = 'lower right')
plt.show()
plt.figure(figsize = (14 , 8))
#Plain vanilla Model
plt.plot(fpr_lr, tpr_lr, label = 'Logistic Regression (area = {})'.format(log_reg_roc_auc.round(2)))
#Class-weighted
plt.plot(fpr_lr_wt, tpr_lr_wt, label = 'Logistic Regression (class weighted) (area = {})'.format(log_reg_wt_roc_auc.round(2)))
#Hyper Parameter Tuned
plt.plot(fpr2, tpr2, label = 'Logistic Regression with Hyperparameter Tuning (area = {})'.\
format(log_reg_tuned_roc_auc.round(2)))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc = 'lower right')
plt.show()
# k = sqrt(no of features)
n = int(np.sqrt(X_train.shape[1])) #Choosing K value based on heuristic of sqrt(no. of features)
print("No of Neighbors is : ", n)
knn_ml = KNeighborsClassifier(n_neighbors=n)
knn_ml.fit(X_train, y_train)
print('KNN Classifier Model\n\n')
print('KNN Classifier Model Accuracy for train set: {0:.3f}'.format(knn_ml.score(X_train, y_train)))
print('KNN Classifier Model Accuracy for test set: {0:.3f}'.format(knn_ml.score(X_test, y_test)))
y_true, y_pred = y_test, knn_ml.predict(X_test)
# Classification Report
print('\nClassification Report')
print('\n{}'.format(classification_report(y_true, y_pred)))
# Confusion Matrix
knn_cm = confusion_matrix(y_true, y_pred)
print('\nConfusion Matrix:\n', knn_cm)
# Accuracy Score
knn_acc = accuracy_score(y_true, y_pred)
print('\nAccuracy Score:\n', knn_acc.round(3))
# Precision Score
knn_precision = precision_score(y_true, y_pred)
print('\nPrecision Score:\n', knn_precision.round(3))
# Recall Score
knn_recall = recall_score(y_true, y_pred)
print('\nRecall Score:\n', knn_recall.round(3))
# F1 Score
knn_f1 = f1_score(y_true, y_pred)
print('\nF1 Score:\n', knn_f1.round(3))
# ROC Curve
knn_roc_auc = roc_auc_score(y_true, knn_ml.predict(X_test))
print('\nROC AUC Score:\n', knn_roc_auc.round(3))
#Gini-Coefficient
knn_gini_coeff = 2 * knn_roc_auc -1
print("\nGini Coefficient :\n", knn_gini_coeff.round(2))
fpr_knn, tpr_knn, thresholds_knn = roc_curve(y_true, knn_ml.predict_proba(X_test)[:,1])
plt.figure(figsize = (14 , 6))
plt.plot(fpr_knn, tpr_knn, label = 'KNN Classifier (area = {})'.format(knn_roc_auc.round(2)))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc = 'lower right')
plt.show()
# KNN with hyperparameter tuning
knn = KNeighborsClassifier()
# params = {'n_neighbors': list(range(1, 20, 2)), 'weights': ['uniform', 'distance'],
# 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
# }
params = {'n_neighbors': list(range(1, 30, 2)), 'weights': ['uniform', 'distance'],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
}
knn_tuned = GridSearchCV(knn, param_grid = params, n_jobs = -1, scoring = 'roc_auc', cv = StratifiedKFold(n_splits = 10))
knn_tuned.fit(X_train, y_train)
print('KNN Scores with Hyperparameter Tuning\n\n')
print('Best Estimator :', knn_tuned.best_estimator_)
print('Best Hyper Parameters : ', knn_tuned.best_params_)
print('Best Score : ', knn_tuned.best_score_.round(3))
print('KNN (tuned) accuracy for train set: {0:.3f}'.format(knn_tuned.score(X_train, y_train)))
print('KNN (tuned) accuracy for test set: {0:.3f}'.format(knn_tuned.score(X_test, y_test)))
y_true, y_pred = y_test, knn_tuned.predict(X_test)
# Classification Report
print("\n")
print("Classification Report")
print('\n{}'.format(classification_report(y_true, y_pred)))
# Confusion Matrix
knn_tuned_cm = confusion_matrix(y_true, y_pred)
print('\nConfusion Matrix:\n', knn_tuned_cm)
# Accuracy Score
knn_tuned_accuracy = accuracy_score(y_true, y_pred)
print('\nAccuracy Score:\n', knn_tuned_accuracy.round(3))
# Precision Score
knn_tuned_precision = precision_score(y_true, y_pred)
print('\nPrecision Score:\n', knn_tuned_precision.round(3))
# Recall Score
knn_tuned_recall = recall_score(y_true, y_pred)
print('\nRecall Score:\n', knn_tuned_recall.round(3))
# F1 Score
knn_tuned_f1 = f1_score(y_true, y_pred)
print('\nF1 Score:\n', knn_tuned_f1.round(3))
# ROC Curve
knn_tuned_roc_auc = roc_auc_score(y_true, knn_tuned.predict(X_test))
print('\nROC AUC Score:\n', knn_tuned_roc_auc.round(3))
#Gini-Coefficient
knn_tuned_gini_coeff = 2 * knn_tuned_roc_auc -1
print("\nGini Coefficient :\n", knn_tuned_gini_coeff.round(2))
fpr_knn_tuned, tpr_knn_tuned, thresholds_knn_tuned = roc_curve(y_true, knn_tuned.predict_proba(X_test)[:,1])
plt.figure(figsize = (14, 6))
plt.plot(fpr_knn_tuned, tpr_knn_tuned, label = 'KNN with Hyperparameter Tuning (area = {})'.format(knn_tuned_roc_auc.round(2)))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc = 'lower right')
plt.show()
Let us find the optimal k-value for KNN using the model performance metrics and error plot.
# List of Odd Numbers for as probable K-values, since it is a binary classifier problem
# K = 1 is omitted since K=1 model will overfit the entire training set.
# Odd values are preferred since our dataset is Binary classification problem.
k_list = list(range(3,20,2))
k_list
# List to store ROC AUC Scores
roc_auc_scores = []
# Fit and Predict values for 'k'values from 1,3,5....19
for k in k_list:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
scores = roc_auc_score(y_test, y_pred) #Use ROC AUC instead of Accuracy Score because of class imbalance
roc_auc_scores.append(scores)
rocauc = roc_auc_scores
# Optimal K-value is the one with maximum ROC AUC Score
optimal_k = k_list[rocauc.index(max(rocauc))]
print("The optimal number of neighbors is %d" % optimal_k)
# plot misclassification error vs k
plt.plot(k_list, rocauc)
plt.xlabel('Number of Neighbors K')
plt.ylabel('ROC AUC Score')
plt.show()
From GridSearch:
Best Estimator : KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=1, p=2, weights='uniform')
#From Optimal K-value calculation
print("The optimal number of neighbors is %d" % optimal_k)
From the error plot, we can verify the Best Parameters derived from Hyperparameter Tuning exercise.
KNN Scores with Hyperparameter Tuning Best Hyper Parameters : {'algorithm': 'auto', 'n_neighbors': 1, 'weights': 'uniform'}
gnb_ml = GaussianNB()
gnb_ml.fit(X_train, y_train)
print('Gaussian Naive Bayes Classifier Model\n\n')
print('Gaussian Naive Bayes Classifier Model Accuracy for train set: {0:.3f}'.format(gnb_ml.score(X_train, y_train)))
print('Gaussian Naive Bayes Classifier Model Accuracy for test set: {0:.3f}'.format(gnb_ml.score(X_test, y_test)))
y_true, y_pred = y_test, gnb_ml.predict(X_test)
# Classification Report
print("\nClassification Report")
print('\n{}'.format(classification_report(y_true, y_pred)))
# Confusion Matrix
gnb_cm = confusion_matrix(y_true, y_pred)
print('\nConfusion Matrix:\n', gnb_cm)
# Accuracy Score
gnb_acc = accuracy_score(y_true, y_pred)
print('\nAccuracy Score:\n', gnb_acc.round(3))
# Precision Score
gnb_precision = precision_score(y_true, y_pred)
print('\nPrecision Score:\n', gnb_precision.round(3))
# Recall Score
gnb_recall = recall_score(y_true, y_pred)
print('\nRecall Score:\n', gnb_recall.round(3))
# F1 Score
gnb_f1 = f1_score(y_true, y_pred)
print('\nF1 Score:\n', gnb_f1.round(3))
# ROC Curve
gnb_roc_auc = roc_auc_score(y_true, gnb_ml.predict(X_test))
#Gini-Coefficient
gnb_gini_coeff = 2 * gnb_roc_auc -1
print("\nGini Coefficient :\n", gnb_gini_coeff.round(2))
fpr_gnb, tpr_gnb, thresholds_gnb = roc_curve(y_true, gnb_ml.predict_proba(X_test)[:,1])
plt.figure(figsize = (12.8 , 6))
plt.plot(fpr_gnb, tpr_gnb, label = 'Gaussian Naive Bayes Classifier (area = {})'.format(gnb_roc_auc.round(2)))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc = 'lower right')
plt.show()
Since the dataset has class imbalance, Accuracy is not a good measure of model performance. Hence, we will resort to below performance measures for model evaluation:
sns.heatmap(log_reg_cm, annot=True, fmt="2d", square=True, linewidth=0.2, cmap='YlGnBu')
plt.title("Confusion Matrix - Logistic Regression - Hyperparameter Tuned")
plt.show()
sns.heatmap(log_reg_tuned_cm, annot=True, fmt="2d", square=True, linewidth=0.2, cmap='YlGnBu')
plt.title("Confusion Matrix - Logistic Regression - Hyperparameter Tuned")
plt.show()
sns.heatmap(log_reg_wt_cm, annot=True, fmt="2d", square=True, linewidth=0.2, cmap='YlGnBu')
plt.title("Confusion Matrix - Logistic Regression - Class Weighted")
plt.show()
sns.heatmap(knn_tuned_cm, annot=True, fmt="2d", square=True, linewidth=0.2, cmap='YlGnBu')
plt.title("Confusion Matrix - K Nearest Neighbors - Hyperparameter tuned")
plt.show()
sns.heatmap(gnb_cm, annot=True, fmt="2d", square=True, linewidth=0.2, cmap='YlGnBu')
plt.title("Confusion Matrix - Gaussian Naive Bayes")
plt.show()
metrics = {'Classifier':['Logistic Regression Class Weighted','Logistic Regression HyperParameter Tuned','K-Nearest Neighbors','Gaussian Naive Bayes'],
'Accuracy' : [log_reg_wt_acc, log_reg_tuned_accuracy, knn_tuned_accuracy,gnb_acc],
'Precision':[log_reg_wt_precision, log_reg_tuned_precision, knn_tuned_precision, gnb_precision],
'Recall':[log_reg_wt_recall, log_reg_tuned_recall, knn_tuned_recall, gnb_recall],
'F1-Score':[log_reg_wt_f1, log_reg_tuned_f1, knn_tuned_f1, gnb_f1],
'ROC AUC':[log_reg_wt_roc_auc, log_reg_tuned_roc_auc, knn_tuned_roc_auc, gnb_roc_auc],
'Gini Coeff':[log_reg_wt_gini_coeff, log_reg_tuned_gini_coeff, knn_tuned_gini_coeff, gnb_gini_coeff],
'Log Loss':[log_loss_class_wt, log_loss_tuned, 0, 0]
}
model_eval_metrics = pd.DataFrame(metrics)
model_eval_metrics = model_eval_metrics.set_index('Classifier')
model_eval_metrics = model_eval_metrics.apply(lambda x: x.round(3))
model_eval_metrics
From the Confusion Matrices, we can see that Logistic Regression with Hyperparameter tuned provides the better balance of results for Precision (TP/TP+FP) and Recall (TP/TP+FN), especially from the Recall numbers.
Since the dataset has Class Imbalance, Accuracy is not a good measure. Few of the metrics we will leverage on for Model Selection are:
plt.figure(figsize = (14 , 8))
#Logistic Regression Plain vanilla Model
plt.plot(fpr_lr, tpr_lr, label = 'Logistic Regression (area = {})'.format(log_reg_roc_auc.round(2)))
#Logistic Regression Class-weighted
plt.plot(fpr_lr_wt, tpr_lr_wt, label = 'Logistic Regression (class weighted) (area = {})'.format(log_reg_wt_roc_auc.round(2)))
# Logistic Regression Hyper Parameter Tuned
plt.plot(fpr2, tpr2, label = 'Logistic Regression with Hyperparameter Tuning (area = {})'.\
format(log_reg_tuned_roc_auc.round(2)))
#KNN with No of Neighbors = sqrt(No of Features)
plt.plot(fpr_knn, tpr_knn, label = 'KNN Classifier (area = {})'.format(knn_roc_auc.round(2)))
#KNN Hyper Parameter Tuned
plt.plot(fpr_knn_tuned, tpr_knn_tuned, label = 'KNN with Hyperparameter Tuning (area = {})'.\
format(knn_tuned_roc_auc.round(2)))
#Gaussian Naive Bayes
plt.plot(fpr_gnb, tpr_gnb, label = 'Gaussian Naive Bayes Classifier (area = {})'.format(gnb_roc_auc.round(2)))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC AUC - Log Reg vs KNN vs GNB')
plt.legend(loc = 'lower right')
plt.show()
#Precision-Recall Curve - Logistic Regression with Hyperparameter tuned
y_scores_log_reg_tuned = log_reg_tuned.decision_function(X_train)
precisions_log_reg_tuned, recalls_log_reg_tuned, thresholds_log_reg_tuned = precision_recall_curve(y_train, y_scores_log_reg_tuned)
#Precision-Recall Curve - Logistic Regression with Class Weights
y_scores_log_reg_wt = log_reg_ml_wt.decision_function(X_train)
precisions_log_reg_wt, recalls_log_reg_wt, thresholds_log_reg_wt = precision_recall_curve(y_train, y_scores_log_reg_wt)
plt.figure(figsize=(8, 6))
plt.plot(recalls_log_reg_wt, precisions_log_reg_wt, "r-", linewidth=2)
plt.plot(recalls_log_reg_tuned, precisions_log_reg_tuned, "b-", linewidth=2)
plt.xlabel("Recall", fontsize=16)
plt.ylabel("Precision", fontsize=16)
plt.axis([0, 1, 0, 1])
plt.grid(True)
plt.plot([0.0, 0.9], [0.45, 0.45], "r:")
plt.plot([0.9, 0.9], [0.0, 0.45], "r:")
plt.plot([0.9], [0.45], "ro")
plt.show()
From the Precision vs Recall curve for both the Logistic Regression classifiers, we can see that,
Source: https://scikit-learn.org/stable/modules/calibration.html#calibration
# Plot calibration plots
from sklearn.calibration import calibration_curve
plt.figure(figsize=(10, 10))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))
ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
for clf, name in [(log_reg_ml_wt, 'Logistic'),
(log_reg_tuned, 'Logistic tuned'),
(knn_tuned, 'KNN'),
(gnb_ml, 'Gaussian Naive Bayes')]:
clf.fit(X_train, y_train)
if hasattr(clf, "predict_proba"):
prob_pos = clf.predict_proba(X_test)[:, 1]
else: # use decision function
prob_pos = clf.decision_function(X_test)
prob_pos = (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())
fraction_of_positives, mean_predicted_value = calibration_curve(y_test, prob_pos, n_bins=10)
ax1.plot(mean_predicted_value, fraction_of_positives, "s-",
label="%s" % (name, ))
ax2.hist(prob_pos, range=(0, 1), bins=10, label=name,
histtype="step", lw=2)
ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc="lower right")
ax1.set_title('Calibration plots (reliability curve)')
ax2.set_xlabel("Mean predicted value")
ax2.set_ylabel("Count")
ax2.legend(loc="upper center", ncol=2)
plt.tight_layout()
plt.show()
model_eval_metrics[['Accuracy','Precision','Recall','F1-Score','ROC AUC', 'Gini Coeff']]
print("Best Performance Model :\n")
model_eval_metrics[['Accuracy','Precision','Recall','F1-Score','ROC AUC', 'Gini Coeff']].idxmax()
Based on the Precision vs Recall curve for both the Logistic Regression classifiers, we can see that,
Based on Log Loss:
Few additional advantages with Logistic Regression model is that it has the probabilities associated with the prediction, which can be used to play with the thresholds and fine-tune our predictions further.
Therefore we recommend 'Logistic Regression' classifier for the given dataset.
** End of Assignment**